For this project you will be doing the Bike Sharing Demand Kaggle challenge! We won't submit any results to the competition, but feel free to explore Kaggle more in depth. The main point of this project is to get you feeling comfortabe with Exploratory Data Analysis and begin to get an understanding that sometimes certain models are not a good choice for a data set. In this case, we will discover that Linear Regression may not be the best choice given our data!
Just complete the tasks outlined below.
You can download the data or just use the supplied csv in the repository. The data has the following features:
Read in bikeshare.csv file and set it to a dataframe called bike.
bike <- read.csv('bikeshare.csv')
Check the head of df
head(bike)
Can you figure out what is the target we are trying to predict? Check the Kaggle Link above if you are confused on this.
# Count is what we are trying to predict
Create a scatter plot of count vs temp. Set a good alpha value.
library(ggplot2)
ggplot(bike,aes(temp,count)) + geom_point(alpha=0.2, aes(color=temp)) + theme_bw()
Plot count versus datetime as a scatterplot with a color gradient based on temperature. You'll need to convert the datetime column into POSIXct before plotting.
bike$datetime <- as.POSIXct(bike$datetime)
ggplot(bike,aes(datetime,count)) + geom_point(aes(color=temp),alpha=0.5) + scale_color_continuous(low='#55D8CE',high='#FF6E2E') +theme_bw()
Hopefully you noticed two things: A seasonality to the data, for winter and summer. Also that bike rental counts are increasing in general. This may present a problem with using a linear regression model if the data is non-linear. Let's have a quick overview of pros and cons right now of Linear Regression:
Pros:
Cons:
We'll keep this in mind as we continue on. Maybe when we learn more algorithms we can come back to this with some new tools, for now we'll stick to Linear Regression.
What is the correlation between temp and count?
cor(bike[,c('temp','count')])
Let's explore the season data. Create a boxplot, with the y axis indicating count and the x axis begin a box for each season.
ggplot(bike,aes(factor(season),count)) + geom_boxplot(aes(color=factor(season))) +theme_bw()
Notice what this says:
We know of these issues because of the growth of rental count, this isn't due to the actual season!
A lot of times you'll need to use domain knowledge and experience to engineer and create new features. Let's go ahead and engineer some new features from the datetime column.
Create an "hour" column that takes the hour from the datetime column. You'll probably need to apply some function to the entire datetime column and reassign it. Hint:
time.stamp <- bike$datetime[4]
format(time.stamp, "%H")
bike$hour <- sapply(bike$datetime,function(x){format(x,"%H")})
head(bike)
Now create a scatterplot of count versus hour, with color scale based on temp. Only use bike data where workingday==1.
Optional Additions:
library(dplyr)
pl <- ggplot(filter(bike,workingday==1),aes(hour,count))
pl <- pl + geom_point(position=position_jitter(w=1, h=0),aes(color=temp),alpha=0.5)
pl <- pl + scale_color_gradientn(colours = c('dark blue','blue','light blue','light green','yellow','orange','red'))
pl + theme_bw()
Now create the same plot for non working days:
pl <- ggplot(filter(bike,workingday==0),aes(hour,count))
pl <- pl + geom_point(position=position_jitter(w=1, h=0),aes(color=temp),alpha=0.8)
pl <- pl + scale_color_gradientn(colours = c('dark blue','blue','light blue','light green','yellow','orange','red'))
pl + theme_bw()
You should have noticed that working days have peak activity during the morning (~8am) and right after work gets out (~5pm), with some lunchtime activity. While the non-work days have a steady rise and fall for the afternoon
Now let's continue by trying to build a model, we'll begin by just looking at a single feature.
Use lm() to build a model that predicts count based solely on the temp feature, name it temp.model
temp.model <- lm(count~temp,bike)
Get the summary of the temp.model
summary(temp.model)
You should have gotten 6.0462 as the intercept and 9.17 as the temp coeffecient. How can you interpret these values? Do some wikipedia research, re-read ISLR, or revisit the Linear Regression lecture notebook for more on this.
How many bike rentals would we predict if the temperature was 25 degrees Celsius? Calculate this two ways:
You should get around 235.3 bikes.
# Method 1
6.0462 + 9.17*25
# Method 2
temp.test <- data.frame(temp=c(25))
predict(temp.model,temp.test)
Use sapply() and as.numeric to change the hour column to a column of numeric values.
bike$hour <- sapply(bike$hour,as.numeric)
Finally build a model that attempts to predict count based off of the following features. Figure out if theres a way to not have to pass/write all these variables into the lm() function. Hint: StackOverflow or Google may be quicker than the documentation.
model <- lm(count ~ . -casual - registered -datetime -atemp,bike )
Get the summary of the model
summary(model)
Did the model perform well on the training data? What do you think about using a Linear Model on this data?
A linear model like the one we chose which uses OLS won't be able to take into account seasonality of our data, and will get thrown off by the growth in our dataset, accidentally attributing it towards the winter season, instead of realizing its just overall demand growing! Later on, we'll see if other models may be a better fit for this sort of data.
You should have noticed that this sort of model doesn't work well given our seasonal and time series data. We need a model that can account for this type of trend, read about Regression Forests for more info if you're interested! For now, let's keep this in mind as a learning experience and move on towards classification with Logistic Regression!
Optional: See how well you can predict for future data points by creating a train/test split. But instead of a random split, your split should be "future" data for test, "previous" data for train.